Modified Makagonov’s Method for Testing Word Similarity and its Application to Constructing Word Frequency Lists

نویسندگان

  • Xavier Blanco
  • Mikhail Alexandrov
  • Alexander Gelbukh
چکیده

By (morphologically) similar wordforms we understand wordforms (strings) that have the same base meaning (roughly, the same root), such as sadly and sadden. The task of deciding whether two given strings are similar (in this sense) has numerous applications in text processing, e.g., in information retrieval, for which usually stemming is employed as an intermediate step. Makagonov has suggested a weakly supervised approach for testing word similarity, based on empirical formulae comparing the number of equal and different letters in the two strings. This method gives good results on English, Russian, and a number of Romance languages. However, his approach does not deal well with slight morphological alterations in the stem, such as Spanish pensar vs. pienso. We propose a simple modification of the method using n-grams instead of letters. We also consider four algorithms for compiling a word frequency list relying on these formulae. Examples from Spanish and English are presented.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Vocabulary Lists for EAP and Conversation Students

Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...

متن کامل

English Vocabulary for Equine Veterans: How Different from GSL and AWL Words

ESP students are usually suggested to master general and academic word lists such as Wests’ (1953) General Service List (GSL) and Coxhead’s (2000) Academic Word List (AWL) to be able to read their academic texts. However, it seems that university students may not need to learn all the words in the two lists as some words in the lists are of less frequency in academic texts. Moreover, there are ...

متن کامل

Knowledge-poor Approach to Constructing Word Frequency Lists, with Examples from Romance Languages

Word frequency lists extracted from documents are widely used in many procedures of text clustering and categorization. Usually for compilation of such lists morphological-based approaches (such as the Porter stemmer) to join the words having the same base meaning are used. However such an approach needs many language-dependent linguistic resources or knowledge when working with multilingual da...

متن کامل

Knowledge-poor Approach to Constructing Word Frequency Lists, with Example from Romance Languages

Word frequency lists extracted from documents are widely used in many procedures of text clustering and categorization. Usually for compilation of such lists morphological-based approaches (such as the Porter stemmer) to join the words having the same base meaning are used. However such an approach needs many language-dependent linguistic resources or knowledge when working with multilingual da...

متن کامل

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006